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OPTIMAL  SAMPLING  IN  SELECTION  PROBLEMS* 
by 

Shanti  S.  Gupta 

s  Purdue  University 

^A  selection  procedure  typically  consists  of  three 
ingredients:  (1)  a  sampling  rule,  (2)  a  stopping  rule,  and 
(3)  a  decision  rule,  though  these  components  are  not  usually 
explicitly  so  labeled.  The  problem  of  optimal  sampling 
arises  in  different  ways  depending  on  the  context  of  the 
problem  at  hand.  Broadly  speaking,  the  problem  of  optimal 
(or  optimum)  sampling  arises  because  of  the  need  for 
balancing  between  the  cost  of  sampling  and  the  cost  of 
making  a  wrong  decision.  Obviously,  increasing  the  amount 
of  sampling  increases  the  former  cost  while  decreasing  the 
latter. c 

1 .  Indifference  Zone  Formulation 

Suppose  we  have  k  independent  populations  wj  .TTg *  -  •  • 
where  the  CDF  of  is  F(x;  e^),  where  the  parameter  ei  has 
an  unknown  value  belonging  to  an  interval  ©  on  the  real 
line.  Our  goal  is  to  select  the  population  associated  with 
the  largest  which  is  called  the  best  population.  In  the 
Indifference  Zone  Formulation  of  Bechhofer  [2],  it  is 
required  that  the  selection  rule  guarantees  with  a  probabil¬ 
ity  at  least  equal  to  P*(l/k  <  P*  <  1)  that  the  best 
population  will  be  chosen  whenever  the  true  parametric 
configuration  o  =  *  *  *°|c)  ^es  a  su*,set  ^e 

parametric  space  nA  characterizing  the  property  that  the 
distance  between  the  best  and  the  next  best  populations  is 
at  least  a.  The  subset  n.  is  called  the  Preference  Zone. 

A 
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The  constants  P*  and  a  are  specified  In  advance  by  the 

experimenter.  The  probability  guarantee  requirement  Is 

referred  to  as  the  P*-requ1rement. 

Now,  let  us  consider  k  Independent  normal  populations 

itj with  unknown  means  ••  »W|(»  respectively, 

2 

and  common  known  variance  o  .  Based  on  samples  of  size  n 
from  each  population,  the  single-stage  procedure  of 
Bechhofer  [2]  for  selecting  the  population  with  the  largest 
uj  selects  the  population  that  yields  the  largest  sample 
mean.  Here  the  preference  zone  Is  defined  by  the  relation 
a,  where  pp j  ^[kj  ^®^®^®  the  ordered 

i^.  The  optimum  sampling  problem  In  this  case  Is  to 
determine  the  minimum  sample  size  n  subject  to  the  P*- 
requirement.  The  optimum  value  of  n  Is  given  by  the 
smallest  Integer  n  for  which 


/>’(x  ♦  <p(x)dx  >  P* 

where  *  and  <p  denote  the  COF  and  the  density  function  of  a 
standard  normal  random  variable. 

Suppose  that  these  normal  distributions  have  unknown 
and  possibly  unequal  variances.  In  this  case,  no  single- 
stage  procedure  exits.  Two-stage  procedures  have  been 
studied  In  this  situation  by  Bechhofer,  Ounnett,  and 
Sobel  [4],  and  Dudewlcz  and  Dalai  [9].  One  may  take  a 
sample  of  size  Hq  from  each  population  at  the  first  stage 
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and  on  the  basis  of  the  information  obtained  from  these 
samples,  determine  the  sizes  of  additional  samples  to  be 
taken  from  these  populations.  The  selection  rule  is  based 
on  the  total  samples  from  all  the  populations.  Even  when 
the  variances  are  known,  one  may  use  a  two-stage  procedure 
in  which  the  first  stage  involves  selection  of  a  nonempty 
subset  of  random  size  with  possible  values  1,2,...,  and  k. 

If  the  first  stage  results  in  a  subset  of  size  larger  than 

1,  then  a  second  stage  ensues  with  additional  samples  from 
those  populations  that  still  remain  under  consideration. 

Such  procedures  have  been  considered  by  Alam  [1],  Tamhane 
and  Bechhofer  [20],  [21]  and  by  Gupta  and  Miescke  [15]  with 
some  modifications.  A  problem  of  optimum  sampling  in  these 
cases  is  to  determine  the  optimal  combination  of  the  sample 
sizes  in  the  two  stages.  This  can  be  done,  for  example 
(Tamhane  and  Bechhofer  [20]),  by  minimizing  the  maximum  of 
the  expected  total  sample  size  for  the  experiment  over  all 
parametric  configurations  subject  to  the  P*- requirement. 

2.  Minimax,  Gamma  Minimax  and  Bayes  Techniques 

Consider  again  k  normal  populations  with 

2 

unknown  means  uj .M2* •  •  •  an<*  common  known  variance  a  . 

If  the  selection  procedure  is  to  take  samples  of  size  n  from 
these  populations  and  choose  the  population  that  yields  the 
largest  sample  mean,  one  can  consider  a  loss  function 


per  observation,  c2  is  a  positive  constant,  and  I.  =  1,  if 
nj  is  selected,  and  *  0  otherwise.  Optimum  n  can  be 
obtained  by  minimizing  the  integrated  risk  assuming  (known) 
prior  distributions  for  ’s;  see  Dunnett  [10].  One  may 
also  determine  the  optimum  n  by  minimizing  the  maximum 
expected  loss  over  all  parametric  configurations.  However, 
the  expected  loss  in  our  case  is  unbounded  above  and  we  can 
find  a  minimax  solution  if  we  have  prior  information 
regarding  the  bounds  on  the  differences 
i  =  1 , . . . , k- 1 . 

Suppose  we  take  a  sample  of  size  n^  from  each  of  k 

normal  populations  with  unknown  means  >^2’* *  * ,pk*  an<* 

o 

common  known  variance  a  .  For  a  fixed  t,  1  <_  t  <_  k-1,  we 
discard  the  populations  that  produced  the  t  smallest  sample 
means  and  take  an  additional  sample  of  size  n2  from  each 
of  the  remaining  k-t  populations.  We  select  as  the  best 
the  population  that  entered  the  second  stage  and  produced 
the  largest  sample  mean  based  on  all  n^+n2  observations. 
Given  that  the  total  sample  size  T  «  kn^+(k-t)n2  is  a 
constant,  the  problem  is  to  determine  the  optimum  alloca¬ 


tion  of  (nj,n2)  by  minimizing  the  maximum  expected  loss, 

k 

wher-  the  loss  is  L  «  c^T  +  c 2  l  as  defined 


earlier.  For  details  see  Sommerville  [19],  and  Fairweather 

[11]. 


In  these  problems,  we  can  also  take  the  gamma-minimax 
approach  and  minimize  the  maximum  expected  risk  over  a 
specified  class  of  prior  distributions  for  the  parameters 
Uj'.  see  Gupta  and  Huang  [14]. 


3.  Comparison  with  a  Control 


An  optimal  sampling  problem  can  be,  as  we  have  seen, 
an  optimal  allocation  problem.  Such  allocation  problems 
are  also  meaningful  when  we  compare  several  treatments  with 
a  control.  Let  . . ,itk  be  k  independent  normal  popu¬ 

lations  representing  the  experimental  treatments  and  let  wq 

be  the  control  which  is  also  a  normal  population.  Let  n. 

2 

have  unknown  mean  and  known  variance  a^,  i  *  0,1,..., k. 

A  multiple  comparisons  approach  is  to  obtain  one- and  two- 

sided  simultaneous  confidence  intervals  for,  say, 

u4-un,  1  =  l»2,...,k.  If  n.  is  the  size  of  the  sample  from 
i  u  i  k 

w. ,  i  =  0,1,..., k,  such  that  T  n,  2  N,  a  fixed  integer, 

1  i=0  1 

then  the  problem  is  to  determine  the  optimal  allocation  of 

the  total  sample  size.  The  optimal  allocation  will  depend, 

besides  other  known  quantities,  on  a  specified  'yardstick' 

associated  with  the  width  of  the  interval.  For  details  of 

these  problems  see  Bechhofer  [3],  Bechhofer  and  Nocturne 

[5],  Bechhofer  and  Tamhane  [6],  and  Bechhofer  and 

Turnbull  [7]. 


Instead  of  taking  the  above  multiple  comparisons 
approach,  one  can  use  the  formulation  of  partitioning  the 


set  of  k  experimental  populations  into  two  sets  one 
consisting  of  populations  that  are  better  than  the  control 
and  the  other  consisting  of  the  remaining  (worse  than  the 
control).  For  a  given  total  sample  size,  the  problem  is  to 
determine  the  optimal  allocation  either  by  minimizing  the 
expected  number  of  populations  misclassified  or  by 
maximizing  the  probability  of  a  correct  decision;  for 
details  see  Sobel  and  Tong  [18]. 

4.  Subset  Selection  Approach 

As  before,  consider  k  independent  populations 
u1  .Tig,. . . ,7^,  where  ir^  is  characterized  by  the  CDF 
F(x;  e.),  1  *  1,... ,k.  In  the  subset  selection  approach, 
we  are  interested  in  selecting  a  nonempty  subset  of  the  k 
populations  so  that  the  selected  subset  will  contain  the 
population  associated  with  the  largest  with  a  guaranteed 
minimum  probability  P*.  The  number  of  populations  to  be 
selected  depends  on  the  outcome  of  the  experiment  and  is 
not  fixed  in  advance  as  in  the  indifference  zone  approach. 

Suppose  we  take  a  random  sample  of  size  n  from  each 
population.  Let  T^,  i  =  l,...,k,  be  suitably  chosen 
statistics  from  these  samples.  In  the  case  of  location 
parameters,  the  procedure  of  Gupta  [12],  [13]  selects 
if  and  only  if  T,  >  T  -D,  whore  T  ■  max(T, . T.  ) 

I  tlfclX  llwlX  I  K 
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and  D  >  0  is  to  be  chosen  such  that  the  P*-requirement  is 
met.  The  constant  D  will  depend  on  k,  P*,  and  n.  Unlike 
in  the  indifference  zone  approach,  we  can  obtain  a  rule  for 
any  given  n  satisfying  the  P*- condition. 

In  the  case  of  k  normal  populations  with  unknown  means 

p 

,V2>  •  •  •  an(*  ^nown  commorl  variance  o  ,  the  rule  of 
Gupta  [12]  selects  tt.  if  and  only  if  X-  >_  ^max^o/t/n* 
where  is  the  mean  of  a  sample  of  size  n  from 
i  =  l,2,...,k.  The  constant  d  is  given  by  the  equation 


/  /^(x+dJcpCxJdx  =  P*. 


The  expected  subset  size,  denoted  by  E(S),  is  given  by 
E(S)  -  \  j  n  ${x+d+  —  (p^. j) }cp(x)dx. 


where  u^-j  £  £•••!  denote  the  ordered  p^.  We 

can  define  the  optimum  sample  size  as  the  minimum  sample 
size  for  which  the  expected  subset  size  or  equivalently, 
the  expected  proportion  of  the  populations  selected  does  not 
exceed  a  specified  bound  when  the  true  parametric  configur¬ 
ation  is  of  a  specified  type.  Relevant  tables  are 
available  in  Gupta  [13]  for  the  equidistant  configuration 
given  by  'i[i+i]"y[i]  =  1  *  l,2,...,k-l,  and  in  Deely 

and  Gupta  [8]  for  the  slippage  configuration  given  by 

v[i]  "•••"  p[k-1]  =  w6- 


fi 


* 

*  ■> 
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If  we  use  the  restricted  subset  selection  approach  in 
which  the  size  of  the  selected  subset  is  random  subject  to 
a  specified  upper  bound,  then  the  P*-condition  is  met  when¬ 
ever  the  parametric  configuration  belongs  to  a  preference 
zone  as  in  the  case  of  Bechhofer's  formulation.  In  this 
case,  the  minimum  sample  size  (assuming  common  sample  size) 
can  be  determined  in  a  similar  way  (Gupta  and  Santner  [17]). 

In  our  discussion  so  far,  the  optimal  sampling  related 
to  optimal  sample  sizes  or  optimal  allocation  under  a 
given  sampling  scheme  such  as  single-stage,  two-stage,  etc. 
One  can  also  seek  the  optimal  sampling  scheme  by  comparing 
single-stage,  multi-stage  and  sequential  procedures. 
Comparisons  of  different  sampling  schemes  for  several 
selection  goals  have  been  made  and  are  available  in  the 
literature.  In  addition  to  the  usual  sampling  schemes, 
inverse  sampling  rules  with  different  stopping  rules  and 
comparisons  involving  vector-at-a-time  sampling  and  Play- 
the-Winner  sampling  scheme  have  been  studied  in  the  case  of 
clinical  trials  involving  dichotomous  data.  References  to 
these  and  other  problems  discussed  can  easily  be  obtained 
from  Gupta  and  Panchapakesan  [16]. 
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